Exploiting parallelism to support scalable hierarchical clustering
نویسندگان
چکیده
A distributed memory parallel version of the group average Hierarchical Agglomerative Clustering algorithm is proposed to enable scaling the document clustering problem to large collections. Using standard message passing operations reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using a subset of a standard TREC test collection, our parallel hierarchical clustering algorithm is shown to be scalable in terms of processors efficiently used and the collection size. Results show that our algorithm performs close to the expected O(n/p) time on p processors, rather than the worst-case O(n/p) time . Furthermore, the O(n/p) memory complexity per node allows larger collections to be clustered as the number of nodes increases. While partitioning algorithms such as k-means are trivially parallelizable, our results confirm those of other studies showing that hierarchical algorithms produce significantly tighter clusters in the document clustering task. Finally, we show how our parallel hierarchical agglomerative clustering algorithm can be used as the clustering subroutine for a parallel version of the Buckshot algorithm to cluster the complete TREC collection at near theoretical runtime expectations.
منابع مشابه
Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities
As we approach billion-transistor processor chips, the need for a new architecture to make eÆcient use of the increased transistor budget arises. Many studies have shown that signi cant amounts of parallelism exist at di erent granularities that is yet to be exploited. Architectures such as superscalar and VLIW use centralized resources, which prohibit scalability and hence the ability to make ...
متن کاملMLPACK: a scalable C++ machine learning library
MLPACK is a new, state-of-the-art, scalable C++ machine learning library, which will be released in early December 2011. Its aim is to make large-scale machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users. MLPACK provides many cutting-edge alg...
متن کاملAn Assessment of a Metric Space Database Index to Support Sequence Homology
Global alignment of sequences based on simple-edit distance forms a metric -space. Hierarchical metric space clustering methods have been commonly used to organize proteomes into taxonomies. Consequently, it is often anticipated that hierarchical clustering can be leveraged as a basis for scalable database index structures capable of managing the hyper-exponential growth of sequence data. M-tre...
متن کاملEffective Parallelization Strategies for Scalable, High Performance Radio Frequency Ray Tracing
We present StingRay, an interactive environment for combined RF simulation and visualization based on ray tracing. StingRay is explicitly designed to support scalable, high performance simulation and visualization of RF energy propagation in complex urban environments using modern, highly parallel computer architectures. We explore three strategies for exploiting parallelism in StingRay and pro...
متن کاملParallelization of a Hierarchical Data Clustering Algorithm Using OpenMP
This paper presents a parallel implementation of CURE, an efficient hierarchical data clustering algorithm, using the OpenMP programming model. OpenMP provides a means of transparent management of the asymmetry and non–determinism in CURE, while our OpenMP runtime support enables the effective exploitation of the irregular nested loop–level parallelism. Experimental results for various problem ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JASIST
دوره 58 شماره
صفحات -
تاریخ انتشار 2007